Optimization Algorithm of Moving Object Detection Using Multiscale Pyramid Convolutional Neural Networks

Object detection and recognition is a very important topic with signifcant research value. Tis research develops an optimised model of moving target identifcation based on CNN to address the issues of insufcient positioning information and low target detection accuracy (convolutional neural network). In this article, the target classifcation information and semantic location information are obtained through the fusion of the target detection model and the depth semantic segmentation model. Te classifcation and position portion of the target detection model is provided by the simultaneous fusion of the image features carrying various information and a pyramid structure of multiscale image features so that the matched image fusion characteristics can be used by the target detection model to detect targets of various sizes and shapes. According to experimental fndings, this method’s accuracy rate is 0.941, which is 0.189 higher than that of the LSTM-NMS algorithm. Trough the migration of CNN and the learning of context information, this technique has great robustness and enhances the scene adaptability of feature extraction as well as the accuracy of moving target position detection.


Introduction
Images and videos have gradually grown importance as a means of information transmission and acquisition due to the rapid advancement of computer technology. Te manual analysis and retrieval method is currently inefective in the face of the massive amounts of video data that are being created on a daily basis. It is also prone to visual fatigue and judgement errors from prolonged monotonous work. Moving object detection is regarded as one of the most fundamental pieces of work in the application of video analysis and is a very practical and difcult topic in the feld of computer vision research. Te science and technology of artifcial intelligence [1], pattern recognition [2,3], and computer vision [4] are all used extensively in its technical principle. Military applications, system monitoring, intelligent transportation [5], commodity inspection, and other scenarios primarily refect its commercial value and application prospects. Traditional detection models typically employ difcult artifcial feature extraction techniques, such as scale-invariant feature transformation and directional gradient histogram, to obtain the expression information pertinent to the target in the original input, and then learn the classifer from the extracted feature information pertinent to the target. Due to the complex changes in natural scenes and the interference of manmade noise, traditional methods still have some issues even though the target detection algorithm [6] has produced good results. For example, the algorithm only extracts a single feature, which cannot refect the infuence of other elements. Some features are computationally expensive; because manual features are not universal, diferent scenes need to design diferent features, which requires a lot of work and high innovation. As a result, there is no single highly accurate detection method, and ongoing research is still required in the feld of object detection technology in complex environments. Te target detection algorithm based on CNN has currently taken over the mainstream after years of development. Given this context, it is unquestionably important to investigate in this article the optimization algorithm of moving target detection based on CNN.
CNN is a brand-new network that combines convolutional operation with multilayer artifcial neural networks (NNs) [7,8]. It automatically instructs the computer to perform a convolution operation to extract the desired features from the image, resulting in more universal and natural-looking features. Additionally, it is robust to a certain amount of distortion. Te CNN-based target detection algorithm has recently undergone signifcant development, greatly enhancing both its speed and detection accuracy. Convolution and pool layers were alternately set in the CNNLeNet-5 model, which can transform the input image into a collection of feature maps through a number of nonlinear transformations, and then classify the feature maps using fully connected NN to complete image recognition. Te back propagation algorithm was used by CNN's training algorithm to train the network under supervision. Target detection technology primarily addresses the difcult task of classifying, identifying, and localising targets in images or videos. Te target detection in video not only completes the target identifcation and location, but also has the task of target tracking. Applying CNN to the target detection technology, using CNN to train and learn the image features with stronger representation ability to replace the artifcially designed features created by human prior knowledge and intelligent design in the past, can make the target detection technology have great progress. However, CNN needs a lot of specimen data for training, and highquality data sets are scarce. Terefore, when the data set is scarce, the model trained by CNN is often inefective.
Additionally, the target recognition accuracy is not very high due to the early traditional CNN's shallow hierarchy. In order to recognise moving targets, this article employs a convolution network model with a more complex structure. Te following are its innovations: (1) Te algorithm architecture in this article can reduce the parameters of the algorithm as much as possible, while preventing overftting. At the same time, the model also randomly samples around the pixels to obtain the pixel slice features of neighboring pixels, so that the network can get the constraints of spatial context information, thus improving the classifcation performance of the model. (2) Tis article proposes a prediction structure that fuses the semantic information of multilevel feature maps to address the issue of how to improve the expressive ability of features by fusing the detailed semantic information of low-level convolution features with the abstract information of high-level convolution features. Te moving target detection model suggested in this article is not constrained by the simple pixel model and the artifcially designed features, so it can better adapt to the complex application scenarios in the real world. Tis is in contrast to the traditional target motion detection methods.

Related Work
Researchers have put forth a number of target detection techniques after years of investigation. Target detection techniques can be roughly categorised into two groups based on diferent technical approaches: target detection techniques based on template matching and target detection techniques based on image classifcation. A face detection technique based on CNN network structure was proposed by Lei et al. Tis technique has gained popularity as a top face detection technique in the face recognition system due to its excellent robustness from all facial directions. Additionally, the detection and identifcation system is used in real-world situations [9]. A brand-newCNN-based target detection model was put forth by Roth et al. Tis model adopts the component detection module, which reduces the amount of calculation by breaking the complex target into multiple distinct components for detection [10]. According to Tran and Hoang, because CNN typically and alternately performs convolution and pooling operations on the input image, the features gradually abstract from low-level to high-level features after numerous nonlinear transformations, giving the features a strong capacity for expressiveness toward the target [11]. Han et al. proposed using SPP-Net to improve the detection speed and accuracy of RCNN. Te problem that the whole connection layer will limit the input size is solved by pooling the spatial pyramid. Te input image does not need to be cut or scaled, which improves the detection accuracy. SPP-Net only needs to extract the forward CNN feature of an image once, which signifcantly improves the detection speed [12]. Long et al. put forward DenseNet classifcation model, in which multiple blocks are stacked in the network, and each convolution layer in each block has a part of feature map that propagates to the next convolution layer, so that the characteristic information of the image can fow better in the network [13]. Ruf et al. proposed to apply CNN to pedestrian detection. In this method, the training samples are used to fne-tune the whole CNN in a supervised way [14]. Tang et al. used a recursive equation to dynamically update the weight of each Gaussian function, and the number of Gaussian functions used in each pixel can be adjusted as needed [15]. Andrearczyk and Whelan proposed to combine the task of pedestrian detection with the task of learning context to optimize CNN by multitask training. Te learned context information contains the attributes of pedestrians and scenes, which efectively assists CNN in pedestrian detection and reduces false positives [16]. Kumam and Bhatnagar proposed that multilayer feature maps should be fused before the target candidate frame is generated to obtain multiscale superfeatures, and then the feature expression of the target can be enhanced. HyperNet greatly improves the accuracy of target detection through more accurate candidate frames and proves the importance of candidate frames to target detection [17]. Li et al. proposed another region-based method. Tis method detects motion by using statistical cyclic shift moments in image areas [18]. Large gaps in target detection were simple to appear in the past due to the poor anti-interference ability of algorithmic detection results. In this article, a brand-newCNN-based optimization algorithm for moving target detection is put forth. Te component detection module is used in this model, which breaks the complex target down into various components and detects each one separately to minimise the amount of calculation. Te training model learns the classifcation rules of the targets after identifying the labels of hidden variables in unlabeled samples. Additionally, to hasten training convergence and raise target recognition accuracy, this article combines two strategies for fne-tuning training simultaneously. Te network is aided in fnding the target object by extending the candidate border into the search area and employing a variety of diferent input areas, which improves the accuracy of the detection result's location. Te models proposed in this article have excellent performance in moving target detection when compared to the current mainstream deep learning and traditional methods, according to extensive experiments on common data sets.

Application of NN in Moving Target
Detection. By examining the geometrical or statistical properties of the target, one can precisely identify and segment it from the image or video. Researchers have gradually expanded their study of deep learning using computers for data acquisition and training as a result of the ongoing development of computer hardware and Internet technology. Additionally, CNN-based structure has emerged as the primary detection and identifcation technique being thought of by many researchers [19]. In the felds of computer vision and digital image processing, CNN, an efective artifcial neural network built on the foundation of conventional artifcial neural networks, is of great importance. Traditional computer systems have difculty describing nonlinearity and creating corresponding models in logical terms, whereas NN has inherent advantages in this area. It has a good overall approximation to complex nonlinear functions by simulating the structure of human brain neurons and having excellent learning capacity. A multilayer perceptron network that can perform input-output mapping is what deep neural networks, or "deep NN," are essentially. CNN is a type of deep NN. Its primary feature is that it uses local connections and weight sharing, which on the one hand reduces the parameters of network training and makes network optimization easier; on the other hand, it prevents overftting to some extent. Te processing mechanism of the visual system in a mammalian cat served as the model for CNN's multilayer network structure. Te entire visual feld is covered by the local receptive feld after it has been tiled for convolution operation. Convolution operation is performed on the input image by CNN to extract image features while at the same time using a convolution kernel with shared values based on the local receptive feld, which can further reduce the number of weights in the CNN model [20]. To automatically extract the target features from the input image, the deep learning theory is applied, and the convolution kernel is calculated by the computer. However, CNN is fairly resilient to a certain amount of distortion and transformation.
Various detection techniques have had excellent success in identifying and detecting targets based on these kinds of points. Both forward propagation and backward propagation are used in convolutional network models. Te output of the network is computed using the forward propagation algorithm, not adjusted. Each neuron calculates its layer's output and transmits it to the layer below it until it reaches the output layer, where it calculates the network's output result. Te set loss function is minimised using the gradient descent method, the weight parameters of the network are adjusted using the back propagation error, and iterative training is used to continuously raise the network's accuracy [21]. Typically, depth is the basis for the classifcation model fow. Following the processing of the input data by multiple stacked convolution layers, activation function layers, and pool layer modules, CNN primarily extracts the image's features. Next, it feeds the predicted loss into the full connection layer and the objective function, and fnally updates the network parameters in accordance with the objective function's loss back propagation. Figure 1 displays the structure of NN.
Te CNN model can input the original image directly, which can reduce the number of labor-intensive preprocessing operations on the original image, streamline the workfow, and increase productivity. Weight sharing is a characteristic of convolution neurons, which lowers the number of nodes and parameters. Te range is controlled by the local receptive feld, which also reduces the complexity of the model. Te image is output using the down sampling layer, and the features are extracted to reduce the dimension. With CNN, which was specifcally created for images, the local features of the previous layer are used to derive the features of the convolution layer via weight sharing. When CNN's structure is properly adjusted, CNN can be made to have the regression capabilities required by various applications. For feature extraction and compression of the input image, the CNN model typically decides to alternately place the convolution layer and sampling layer in the lower layer of the network. At the top level of CNN model, the full connection layer will be selected to perform a series of processing such as regression classifcation on the image features extracted from the previous layer-by-layer transformation and mapping, and at the same time, the extracted image features will be aggregated into feature vectors as a new representation of the input image. Because the convolution operation and down sampling operation are added to the basic structure of artifcial NN, the extracted features of CNN have some invariable properties in space, and it is more suitable for image detection than traditional NN. At present, in the mainstream CNN image classifcation models, multiple convolution layers are stacked in each module to extract image features. CNN introduces convolution calculation and sampling operation, so that the computer can automatically learn the target features from the input image, and it has good recognition efect for diferent targets, and also has good robustness to some degree of distortion and other changing factors. Tese advantages are based on its three main features: sparse connection, weight sharing, and sampling. Generally, the low-level structure of CNN consists of convolution layer and pool sampling layer alternately, while the high-level structure consists of full connection layer and Computational Intelligence and Neuroscience corresponding classifer. Te input of the frst fully connected layer is the feature image of the lower structure after feature extraction, and the last layer uses the classifer as the output layer. Among them, logistic regression, Softmax regression, and support vector machine can be used to classify the images fnally. In the model structure, the characteristic weight sharing is used to perform convolution kernel operation on the whole image with the same convolution kernel, so as to generate a complete feature graph. Tis operation will greatly reduce the scale of weight parameters required by the network structure. Under the same circumstances, the number of weight parameters is greatly reduced after adopting the weight sharing operation, which is one of the most signifcant evolutions of CNN. CNN adopts down sampling operation, which is located behind the convolution layer. It down samples the features of the convolution output layer and integrates the information. It is possible to reduce the computational complexity of the network by greatly reducing the hidden layer, including the number of units between the input layer and the output layer. Additionally, the model exhibits some aspects of spatial invariance. We can gradually classify images from easy to difcult by cascading multiple CNNs, which will lessen the complexity of training.

CNN Structure.
Tree components make up the convolution layer, sampling layer, and full connection layer in a conventional CNN model. Additionally, the central components of feature extraction are the convolution layer and pool layer. Tis section will introduce the CNN model's convolution layer, pool layer, activation function layer, full connection layer, and target loss function, in that order. Convolution layer is crucial to the success of CNN network model because it is the primary structure for extracting image features. Te number of parameters and the amount of calculation in the convolution layer can be kept within a reasonable range by the sparse connection mechanism and the shared weight scheme. In reality, CNN uses a two-dimensional discrete convolution operation for convolution. Each convolution kernel conducts a convolution operation on the local receptive feld of the feature map of the preceding layer as it connects each layer to the one before it. Convolution kernel parameters are connected to local pixels in the corresponding feature graph and can be thought of as traditional NN weight parameters. Each convolution layer in the entire CNN classifcation model is made up of convolution kernels with multiple channels, so as convolution kernel parameters increase, those of large convolution kernels will increase signifcantly relative to those of small convolution kernels. Te process of calculating each feature map in the convolution layer can be broken down into three steps: frst, various convolution kernels are convolved with the feature map corresponding to the previous layer; and second, the result is multiplied by the number of feature maps in the convolution layer.Te associated convolution ofsets and results are then added up. Finally, a nonlinear activation function is used to combine the results, and the result is a convolution layer characteristic map. Te input image is mapped to the hidden layer feature space following network convolution. Te fully connected layer then performs a classifer by mapping the hidden layer's feature representation into the sample label space.
After the input has passed through the convolution layer, the obtained result needs to be transformed nonlinearly by the activation function, which is equivalent to convolving the input and then connecting a nonlinear connection layer. Terefore, the activation function plays a vital role in deep CNN. Sigmoid function is mathematically defned as follows: (1) Because the function curve is smooth, continuous, and monotonic and because the value is between 0 and 1, it can be viewed as a probability distribution function for classifcation problems, it is frequently used as the threshold function of NNs. Te nonzero centre of the Sigmoid function is a problem that is resolved by the emergence of the Tanh function, whose convergence rate is faster than that of the Sigmoid function. Tanh's output and input can maintain a nonlinear monotonic rising and falling relationship, which is consistent with the BP network's gradient solution. It is described as follows: Te ReLU function is defned as follows: When the input signal is less than 0, the output of the ReLU activation function is entirely 0. When the signal is greater than 0, the input and output are linearly equal.
Te CNN network model's sampling layer, also referred to as the pool layer, is crucial for lowering the computational complexity of the model and the resolution of image features. Features that are useful in one area of an image may also be applicable to other areas of the image because images are static. Te specifc operation process of the pool layer is to aggregate the features of small areas on the previous feature map for statistics in order to describe large image information. Maximum pool refers to the result of taking the maximum value as the input and average pool refers to the calculation of the average value of pixels in small areas as the pooled value. Te small area currently corresponding to the previous layer is then replaced by this value on the feature map of the subsequent layer after the result value has been used to replace the current small area. Te sampling layer typically comes after the convolution layer and operates statistically on the limited set of image features that the previous convolution layer extracted. Te sampling layer is static, and its parameters don't afect how back propagation is changed. Te pool layer's primary purpose is down sampling and further reducing the number of parameters by removing pointless samples from the feature graph. Te two most popular pooling techniques in the CNN model are maximum pooling and mean pooling. Te CNN classifcation model, in general, primarily employs the maximum pool operation. Trough numerous convolution layers and pool layers, CNN gradually transitions from low-level features to high-level features in order to extract the features of the input image. Te output layer and full connection layer classify the high-level features, and a vector is created to represent the category of the input image. Figure 2 depicts the CNN organisational structure.
Te whole connection layer is usually set at the tail of CNN, and all neurons between adjacent layers are connected with weights, which is the same as the connection way of neurons in traditional NN. Te classifer for the entire model is implemented throughout the connection layer. Te image features extracted from the previous convolution layer and pool layer are mapped to the mark space of the sample, and the input of each node in the entire connection layer is connected with each output node in the previous layer. After feature extraction, the entire connected layer is located, and every neuron in the previous layer is connected to every other neuron in the entire connected layer. All of the connection layer's high-level features can be mapped in accordance with the specifc output layer tasks. Te primary purpose of the entire connection layer is to combine the twodimensional feature map's feature information into a onedimensional feature vector, which is then used by the classifer to classify the image using the extracted onedimensional feature vector. Te output layer is the fnal output structure, which will be classifed as the output of the deep learning network structure, and the image classifcation category will be judged by using the regression function. Te traditional CNN parameters are huge in scale, and the training process is complex and redundant, which invisibly increases the computational overhead and afects the network performance. Tis article will improve it. Te Softmax classifer is shown as follows: In the formula, x is the feature vector output by the fully connected layer and M is the number of categories of the classifcation.

Construction of Moving Target Detection Model Based on CNN.
Te artifcial feature modelling method involves extracting the features from the original image using a feature extraction technique that was artifcially created, putting those features into a classifer so it can learn the rules of classifcation, and then letting the trained classifer fnd the new target. Te main steps are as follows: information collection, preprocessing, feature extraction, feature selection, and classifcation training. In the process of video sequence production, transmission, and recording, these images are easily mixed with other noises, resulting in data loss. Te image quality will be degraded, and even compression or transmission errors will occur. Terefore, in the process of target detection, pretreatment must be carried out to reduce the interference of noise. Te preprocessing process is to denoise the detected image, enhance the image and transform the color space. Te process of window sliding is to slide a window with a fxed size in the detected image, and take the subimage in the window as the candidate area. CNN model can directly input the original image, which can avoid a large number of complicated preprocessing operations on the original image, simplify the working steps, and improve the working efciency. At the same time, CNN can completely recover the required information from incomplete or noisy input signals. In this article, through preprocessing, the fnal detection model can obtain features more in line with the characteristics of the target, and improve the accuracy of target detection.

Computational Intelligence and Neuroscience
Median fltering is a common postprocessing method in motion detection, which is mainly used to eliminate outliers and smooth noise in foreground images. Te output of the network is fltered by spatial median, and then each pixel value is processed by global threshold. Mean fltering is a linear fltering algorithm in spatial domain. Te theoretical principle uses smooth template to convolute the image and replace the target pixel value with the neighborhood average around the central pixel of the template, so as to remove the noise existing in the image. By averaging the pixels in the template, the pixel (x, y) we need to process is replaced by the average value obtained, and the gray value of the pixel to be processed is g(x, y), that is, where the total number of pixels in the template is noted because, typically, we compute the weighted average of the template to ensure that the processed pixels' grey values are consistent with the original pixels. In general, the median flter's processing efect is superior to that of the mean flters because it preserves more of the image's feature information while maintaining the sharp inner edge of the image. However, using median fltering may produce a very rough background image when there are many moving objects in the video, which will have a signifcant negative efect on the detection model. In this article, the fltering algorithm in the convolution model adopts the following formula to describe the model structure risk minimization function for training: Te linear transformation function input by the image is shown in the following formula: Te coefcient ω vector can be used to represent the weight z value. In the target detection task using the correlation flter function, the image information feature sample is usually used as the input variable, and the involved kernel function is shown in the following formula: Te input image is detected by the flter β generated after training, and its image information is calculated and displayed in response to the calculation, as shown in the following formula: After calculating the response value of its image feature, the flter update can be performed on the matched image sample feature. As the gradient of information transmission, specifcally, it is the gradient of the error to the weight parameter W and the gradient of the bias parameter zE/zW. Te weight parameter b and the bias parameter zE/zH are adjusted according to these two gradients. In the simplest case, the updated formulas for both are as follows: In the formula, ∆W is the update amount of the weight parameter W, ∆b is the update amount of the bias parameter b, and η is the learning rate. Te model proposed in this article uses the momentum formula to calculate the update of the parameters as follows:  Computational Intelligence and Neuroscience where μ is the momentum coefcient. Tis added momentum term reduces the stepping in the direction of high curvature, thereby indirectly increasing the efective learning rate in the direction of low curvature, improving the speed of convergence in the learning algorithm. Tree steps make up the CNN's training process: forward calculation, backward calculation, and parameter updating. Te correct gradient calculation is the key to the back propagation algorithm, and it is necessary to calculate the gradient of each layer input and parameter with respect to the objective function. To make the calculation of the gradient of the previous layer parameter about the objective function by applying the chain rule easier, the gradient of each layer input about the objective function is computed. Te parameters of the network are updated by back propagation during training, and the loss between the predicted category and the actual category is calculated according to the loss function. Te detection boxes are categorised during testing using the Softmax function. In this article, by increasing the step size of convolution kernel in some convolution layers to replace the original down sampling function of the maximum pool layer in the network model, the complexity and calculation amount of the model are reduced; so as to improve the running efciency of the model based on depth separable convolution. In addition, this model adopts an adaptive learning rate reduction method. Tis method uses a larger initial value to accelerate convergence, and in the subsequent learning process, the learning rate will be judged according to the diference between the current cost function and the previous cost function. If the diference is not obvious, the current learning rate is halved as the new learning rate. As shown in Table 1.

Result Analysis and Discussion
In this article, INRIA human database was used to evaluate the multicomponent detection efect of complex targets. Tis database contains all kinds of standing pedestrian images with diferent human postures and background environments collected from GRAZ01 data set, personal digital images, network images, etc. Each image has a pedestrian height of at least 100 pixels. Among them, 85% of the samples are used as training sets; 15% of the samples are used as the test set. Image graphics card can provide parallel accelerated computing function for CNN network model on a reasonable software platform. Graphics card in this article is NVIDIAGeForceGTX1080GPU. Operating system is Ubuntu, GPU graphics card nvidiadrok6000, memory is 64 G, based on Tensorfow deep learning framework. Te software platforms used are Cafe 2 and OpenCV 3.2, and the programming languages used are C++ and Python. Te mini-batch size of the improved CNN model structure training stage is 256; Te ratio of positive and negative samples is 1 : 3; Te initial learning rate is 0.0001; Train 10 k rounds. Te improved CNN model uses Softmax function to classify the categories of detection boxes in the testing phase. Table 2 shows the test results of insulator data set.
In reality, there are usually special target detection and scene detection tasks, but due to the limited source of data sets, the detection tasks cannot be completed because of insufcient data sources. Terefore, in this article, data enhancement is used to avoid the problem of insufcient data sources, which can efectively improve the detection target recognition rate and can be well applied to CNN model training. Te comparison before and after data enhancement is shown in Figure 3.
ROC curve was used to visually represent the experimental results of each foreground detection method. Calculate the true positive rate (TPR) and false positive rate (FPR), and then draw the ROC curve. ROC curves of each algorithm are shown in Figure 4. Te closer the curve is to the longitudinal axis of TPR, the lower the false detection rate of experimental results. Te closer to the horizontal axis of FPR, the higher the false detection rate of experimental results.
Te interframe diference method has the worst detection efect in the data set, and there are a lot of false detections in the complex scene with global illumination shadow, as can be seen from Figure 4. Vibe algorithm has better detection results and a lower false detection rate when compared to experimental results of the interframe diference method. Te algorithm described in this article has the best checking efect of all of them. Te method used in this article to address the issue of identifying small targets means that background-flled images not only do not contain the target information itself but also have an impact on training results. Te detection efect of the images on small targets will be improved if the small targets are processed centrally. As a result, the images are joined together, and the joining is performed in an unfxed mode. Precision rate and recall rate are typically used in target detection to assess the efectiveness of the detection model. Simply put, the recall rate shows the proportion of correctly identifed objects to all the objects that needed to be detected. Te accuracy rate represents the percentage of targets that were correctly predicted across all suggestion boxes by the target detection  Computational Intelligence and Neuroscience model. Tis section has been put to the test, and Figure 5 displays the accuracy test results for various algorithms. Figure 6 displays the recall test results for various algorithms. In addition, this article uses a curve called DET (Detection Error Trade-of) to show the global detection efect of the model. DET curve uses the logarithm of missed detection rate relative to the frst error rate of unit window to evaluate the detection performance of the model. Figure 7 shows the DET curves of diferent models.
Te DET graph illustrates how the detection efect of the model on the target improves with decreasing curve position. Figure 7 illustrates how the model suggested in this article outperforms other comparison models in terms of overall detection performance. Te experimental results conclusively demonstrate that the model presented in this   article is more efective at detecting complex targets than some of the main detection models currently in use. Te dynamic background is tested experimentally using the dynamic background data set in change detection. Table 3 displays the performance evaluation fndings for each algorithm. Table 3 shows that the results of data statistics and experimental detection agree with each other. Te algorithm suggested in this article has a false detection rate that is lower than that of the comparison algorithm and a higher accuracy than the comparison algorithm. Because it can extract moving objects better, the improved method's experimental   Computational Intelligence and Neuroscience results in this article outperform those of the comparison algorithm. Te experimental results in this section show that the accuracy of this algorithm is 0.941, which is higher than that of LSTM-NMS algorithm by 0.189. Tis method improves the scene adaptability of feature extraction and the accuracy of moving target location detection by migrating CNN and learning context information. It accelerates the convergence speed of training and has strong robustness.

Conclusions
Target detection is very difcult, with the two main challenges being: (1) For moving targets, scale diferences, local occlusion, attitude changes, and other factors will signifcantly alter the targets' appearance, leading to false positives in target detection.
(2) Te more complex the scene, the harder it is to tell the target from the nontarget, leading to false positives and false negatives in target detection. For the scene, the appearance of the target will also be deformed due to factors such as the change of illumination and visual angle. Tis article develops an optimised model of moving target detection based on CNN to address the issues of insufcient positioning information and low target detection accuracy. In this article, the target classifcation information and semantic location information are obtained through the fusion of the target detection model and the depth semantic segmentation model. Tis article suggests a prediction structure that fuses the semantic information of multilevel feature maps in order to address the issue of how to improve the expressive ability of features by fusing the detailed semantic information of low-level convolution features with the abstract information of high-level convolution features. Te moving target detection model suggested in this article is not constrained by the simple pixel model and the artifcially designed features, so it can better adapt to the complex application scenarios in the real world. Tis is in contrast to the traditional target motion detection methods. According to experimental fndings, this algorithm's accuracy rate is 0.941, which is 0.189 higher than that of the LSTM-NMS algorithm. In this article, the model reduces computation and gets around the issue of insufcient parameter learning. Additionally, it accomplishes the desired result and displays improved robustness to noise and shadow interference. Tis article is valuable both commercially and practically. In our upcoming research, we will be able to deepen the testing speed of the model by optimising the learning network, the suggestion box information, and the feature information of the detection box.

Data Availability
Te data used to support the fndings of this study are included within the article.

Conflicts of Interest
Te authors declare that they have no conficts of interest.